Skip to content

Conversation

@victor-eds
Copy link
Contributor

Detect basic shuffles and lower to gpu.shuffle operations. Basically, support cases in which we go from each work-item having a single tensor element to having sub_group_size tensor elements such as element i corresponds to the element originally held by work-item i in the sub-group.

Upstream MLIR pass should handle all integer and floating point types. Drop code handling type legalization for such types when done. Pointer type should still be done in this project.

Code should be extended to support other kind of shuffles.

Multi-sub-group case not yet implemented.

…cases

Detect basic shuffles and lower to `gpu.shuffle` operations. Basically, support
cases in which we go from each work-item having a single tensor element to having
`sub_group_size` tensor elements such as element `i` corresponds to the element
originally held by work-item `i` in the sub-group.

Upstream MLIR pass should handle all integer and floating point types.
Drop code handling type legalization for such types when done. Pointer type
should still be done in this project.

Code should be extended to support other kind of shuffles.

Multi-warp case not yet implemented.

Signed-off-by: victor-eds <[email protected]>
@victor-eds
Copy link
Contributor Author

Part of #2266.

@chengjunlu
Copy link
Contributor

Is it urgent for the OKR performance?
If it was not urgent, Could we update this change to the upstream Triton first and then pull it to downstream?

@victor-eds
Copy link
Contributor Author

Is it urgent for the OKR performance? If it was not urgent, Could we update this change to the upstream Triton first and then pull it to downstream?

I agree this should be upstreamed. However, it isn't generic enough IMO. I would like to work more on this before upstreaming. I think I would rather have this merged here and have a generic version upstreamed. WDYT?

@etiotto etiotto linked an issue Oct 23, 2024 that may be closed by this pull request
@chengjunlu
Copy link
Contributor

Is it urgent for the OKR performance? If it was not urgent, Could we update this change to the upstream Triton first and then pull it to downstream?

I agree this should be upstreamed. However, it isn't generic enough IMO. I would like to work more on this before upstreaming. I think I would rather have this merged here and have a generic version upstreamed. WDYT?

Make sense. We can make it general gradually in down stream first.

@victor-eds victor-eds enabled auto-merge (squash) October 24, 2024 10:54
@victor-eds victor-eds disabled auto-merge October 24, 2024 10:57
@victor-eds victor-eds enabled auto-merge (squash) October 24, 2024 10:59
@victor-eds victor-eds disabled auto-merge October 24, 2024 10:59
@victor-eds victor-eds enabled auto-merge (squash) October 24, 2024 11:38
@victor-eds victor-eds merged commit 6647f59 into intel:main Oct 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port "sub-group transpose reduction" to default path

4 participants